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\ Abstract. Sparse estimation methods are aimed at using or obtaining 

CN ■ parsimonious representations of data or models. While naturally cast 

5-H . as a combinatorial optimization problem, variable or feature selection 

admits a convex relaxation through the regularization by the £i-norm. 
In this paper, we consider situations where we are not only interested in 
sparsity, but where some structural prior knowledge is available as well. 
We show that the ^i-norm can then be extended to structured norms 
built on either disjoint or overlapping groups of variables, leading to a 
flexible framework that can deal with various structures. We present 
applications to unsupervised learning, for structured sparse principal 
component analysis and hierarchical dictionary learning, and to super- 
vised learning in the context of non-linear variable selection. 
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The concept of parsimony is central in many scientific domains. In the context 
of statistics, signal processing or machine learning, it takes the form of variable 
or feature selection problems, and is commonly used in two situations: First, to 
make the model or the prediction more interpretable or cheaper to use, i.e., even 
if the underlying problem does not admit sparse solutions, one looks for the best 
sparse approximation. Second, sparsity can also be used given prior knowledge 



^ ' that the model should be sparse. 



Sparse linear models seek to predict an output by linearly combining a small 
subset of the features describing the data. To simultaneously address variable 
selection and model estimation, ^i-norm regularization has become a popular 
tool, which benefits both from efficient algorithms (see, e.g., Efron et al., 2004; 
Beck and Teboulle, 2009; Yuan, 2010; Bach et al., 2012, and multiple references 
therein) and a well-developed theory for generalization properties and variable 
selection consistency (Zhao and Yu, 2006; Wainwright, 2009; Bickel, Ritov and 
Tsybakov, 2009; Zhang, 2009). 
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When regularizing with the £i-norm, each variable is selected individually, re- 
gardless of its position in the input feature vector, so that existing relationships 
and structures between the variables (e.g., spatial, hierarchical or related to the 
physics of the problem at hand) are merely disregarded. However, in many prac- 
tical situations the estimation can benefit from some type of prior knowledge, 
potentially both for interpretability and to improve predictive performance. 

This a priori can take various forms: in neuroimaging based on functional 
magnetic resonance (fMRI) or magnetoencephalography (MEG), sets of voxels 
allowing to discriminate between different brain states are expected to form small 
localized and connected areas (Gramfort and Kowalski, 2009; Xiang et al., 2009, 
and references therein). Similarly, in face recognition, as shown in Section 4.4, 
robustness to occlusions can be increased by considering as features, sets of pixels 
that form small convex regions of the faces. Again, a plain ^i-norm regularization 
fails to encode such specific spatial constraints (Jenatton, Obozinski and Bach, 

2010) . The same rationale supports the use of structured sparsity for background 
subtraction (Cevher et al., 2008; Huang, Zhang and Metaxas, 2011; Mairal et al., 

2011) . 

Another example of the need for higher-order prior knowledge comes from 
bioinformatics. Indeed, for the diagnosis of tumors, the profiles of array-based 
comparative genomic hybridization (arrayCGH) can be used as inputs to feed a 
classifier (Rapaport, Barillot and Vert, 2008). These profiles are characterized by 
many variables, but only a few observations of such profiles are available, prompt- 
ing the need for variable selection. Because of the specific spatial organization 
of bacterial artificial chromosomes along the genome, the set of discriminative 
features is expected to consist of specific contiguous patterns. Using this prior 
knowledge in addition to standard sparsity leads to improvement in classification 
accuracy (Rapaport, Barillot and Vert, 2008). In the context of multi-task re- 
gression, a problem of interest in genetics is to find a mapping between a small 
subset of loci presenting single nucleotide polymorphisms (SNP's) that have a 
phenotypic impact on a given family of genes (Kim and Xing, 2010). This target 
family of genes has its own structure, where some genes share common genetic 
characteristics, so that these genes can be embedded into some underlying hier- 
archy. Exploiting directly this hierarchical information in the regularization term 
outperforms the unstructured approach with a standard ^i-norm (Kim and Xing, 
2010). 

These real world examples motivate the need for the design of sparsity-inducing 
regularization schemes, capable of encoding more sophisticated prior knowledge 
about the expected sparsity patterns. As mentioned above, the ^i-norm corre- 
sponds only to a constraint on cardinality and is oblivious of any other informa- 
tion available about the patterns of nonzero coefficients ("nonzero patterns" or 
"supports") induced in the solution, since they are all theoretically possible. In 
this paper, we consider a family of sparsity-inducing norms that can address a 
large variety of structured sparse problems: a simple change of norm will induce 
new ways of selecting variables; moreover, as shown in Section 3.5 and Section 3.6, 
algorithms to obtain estimators (e.g., convex optimization methods) and theo- 
retical analyses are easily extended in many situations. As shown in Section 3, 
the norms we introduce generalize traditional "group £i-norms", that have been 
popular for selecting variables organized in non-overlapping groups (Turlach, Ven- 
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ables and Wright, 2005; Yuan and Lin, 2006; Roth and Fischer, 2008; Huang and 
Zhang, 2010). Other famihes for different types of structures are presented in 
Section 3.4. 

The paper is organized as follows: we first review in Section 2 classical £i- 
norm regularization in supervised contexts. We then introduce several families of 
norms in Section 3, and present applications to unsupervised learning in Section 4, 
namely for sparse principal component analysis in Section 4.4 and hierarchical 
dictionary learning in Section 4.5. We briefly show in Section 5 how these norms 
can also be used for high-dimensional non-linear variable selection. 

Notations. Throughout the paper, we shall denote vectors with bold lower 
case letters, and matrices with bold upper case ones. For any integer j in the set 
|l;p] = {1, . . . we denote the j-th coefficient of a p-dimensional vector w G 
by Wj. Similarly, for any matrix W E M^^^, we refer to the entry on the i-th row 
and j-th column as Wjj, for any G [[1;?t-]] x [[l;p]. We will need to refer to 

sub-vectors of w G M^, and so, for any J C [l;p], we denote by wj € RI'^I the 
vector consisting of the entries of w indexed by J. Likewise, for any / C |{l;n], 
J CI [[l;p]], we denote by W/j G mI-^^I^I-^I the sub-matrix of W formed by the 
rows (respectively the columns) indexed by / (respectively by J). We extensively 
manipulate norms in this paper. We thus define the iq-norm for any vector w € 
by ||w||^ = |wj|'? for q G [l,oo), and ||vi^||oo - maxjgp.p] |v^^j|. For q £ 

(0, 1), we extend the definition above to iq pseudo-norms. Finally, for any matrix 
W G ]R"^P, we define the Probenius norm of W by ||W||2 4 ^^^^ Y.%i^ij- 

2. UNSTRUCTURED SPARSITY VIA THE £i-NORM 

Regularizing by the £i-norm has been a topic of intensive research over the last 
decade. This line of work has witnessed the development of nice theoretical frame- 
works (Tibshirani, 1996; Chen, Donoho and Saunders, 1998; Mallat, 1999; Tropp, 
2004, 2006; Zhao and Yu, 2006; Zou, 2006; Wainwright, 2009; Bickel, Ritov and 
Tsybakov, 2009; Zhang, 2009; Negahban et al., 2009) and the emergence of many 
efficient algorithms (Efron et al., 2004; Nesterov, 2007; Friedman et al., 2007; 
Wu and Lange, 2008; Beck and Teboulle, 2009; Wright, Nowak and Figueiredo, 
2009; Needeh and Tropp, 2009; Yuan et al, 2010). Moreover, this methodology 
has found quite a few applications, notably in compressed sensing (Candes and 
Tao, 2005), for the estimation of the structure of graphical models (Meinshausen 
and Biihlmann, 2006) or for several reconstruction tasks involving natural im- 
ages (e.g., see Mairal, 2010, for a review). In this section, we focus on supervised 
learning and present the traditional estimation problems associated with sparsity- 
inducing norms such as the ^i-norm (see Section 4 for unsupervised learning). 

In supervised learning, we predict (typically one-dimensional) outputs y in 3^ 
from observations x'm X; these observations are usually represented by p-dimen- 
sional vectors with X = MP. M-estimation and in particular regularized empirical 
risk minimization are well suited to this setting. Indeed, given n pairs of data 
points {(xW,yW) G M^'xJ^; i=l,...,n}, we consider the estimators solving the 
following form of convex optimization problem 



where £ is a loss function and : — )• M is a sparsity-inducing — typically non- 



(2.1) 
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smooth and non-Euclidean — norm. Typical examples of difFerentiable loss func- 
tions are the square loss for least squares regression, i.e., y) = \{y — y)'^ with y 
in R, and the logistic loss i{y,y) = log(l + e~^^) for logistic regression, with y 
in {—1, 1}. We refer the readers to Shawe- Taylor and Cristianini (2004) and to 
Hastie, Tibshirani and Friedman (2001) for more complete descriptions of loss 
functions. 

Within the context of least-squares regression, £i-norm regularization is known 
as the Lasso (Tibshirani, 1996) in statistics and as basis pursuit in signal process- 
ing (Chen, Donoho and Saunders, 1998). For the Lasso, formulation (2.1) takes 
the form 

(2.2) min — lly — Xwllo + A||w||i , 



and, equivalently, basis pursuit can be written^ 

(2.3) min -llx - Dall^ + A||q;||i. 

amp 2 

These two equations are obviously identical but we write them both to show the 
correspondence between notations used in statistics and in signal processing. In 
statistical notations, we will use X € R'^^^ to denote a set of n observations de- 
scribed by p variables (covariates), while y G R" represents the corresponding set 
of n targets (responses) that we try to predict. For instance, y may have discrete 
entries in the context of classification. With notations of signal processing, we will 
consider an m-dimensional signal x G R"^ that we express as a linear combina- 
tion of p dictionary elements composing the dictionary D = [d"*^, . . . , d^] S W^^^. 
While the design matrix X is usually assumed fixed and given beforehand, we 
shall see in Section 4 that the dictionary D may correspond either to some pre- 
defined basis (e.g., see Mallat, 1999, for wavelet bases) or to a representation that 
is actually learned as well (Olshausen and Field, 1996). 

Geometric intuitions for the ii-norm ball. While we consider in (2.1) a regu- 
larized formulation, we could have considered an equivalent constrained problem 
of the form 

1 " 

(2.4) min £(?/(*), w'^xW) such that (7(w) < 

i=l 

for some /i £ R+: It is indeed the case that the solutions to problem (2.4) obtained 
when varying fi is the same as the solutions to problem (2.1), for some of 
depending on fi (e.g., see Section 3.2 in Borwein and Lewis, 2006). 

At optimality, the opposite of the gradient of / : w i— )• ^ S"=i ^(y^*\ w~''x(*)) 
evaluated at any solution w of (2.4) must belong to the normal cone to ;S = {w G 
M.P; r2(w) < at w (Borwein and Lewis, 2006). In other words, for sufficiently 
small values of (i.e., ensuring that the constraint is active) the level set of / for 
the value /(w) is tangent to B. As a consequence, important properties of the 



^Note that the formulations which are typically encountered in signal processing are either 
mine [|Q!||i s.t. x — Da, which corresponds to the limiting case of Eq. (2.3) where A ^ and 
X is in the span of the dictionary D, or mine ||a||i s.t. ||x — Dq:||2 < J? which is a constrained 
counterpart of Eq. (2.3) leading to the same set of solutions (see the explanation following 
Eq. (2.4)). 
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solutions w follow from the geometry of the ball B. If il. is taken to be the £2- 
norm, then the resulting ball B is the standard, isotropic, "round" ball that does 
not favor any specific direction of the space. On the other hand, when Q is the 
£i-norm, B corresponds to a diamond-shaped pattern in two dimensions, and to 
a double pyramid in three dimensions. In particular, B is anisotropic and exhibits 
some singular points due to the non-smoothness of fi. Since these singular points 
are located along axis-aligned linear subspaces in M^, if the level set of / with the 
smallest feasible value is tangent to B at one of those points, sparse solutions are 
obtained. We display on Figure 1 the balls B for both the £1- and ^2-iiorms. See 
Section 3 and Figure 2 for extensions to structured norms. 

e 



(a) ^2-norm ball (b) ^i-norm ball 

Fig 1: Comparison between the ^2-iiorm and ^i-norm balls in three dimensions, 
respectively on the left and right figures. The £i-norm ball presents some singular 
points located along the axes of and along the three axis-aligned planes going 
through the origin. 

3. STRUCTURED SPARSITY-INDUCING NORMS 

In this section, we consider structured sparsity-inducing norms that induce 
estimated vectors that are not only sparse, as for the ^i-norm, but whose support 
also displays some structure known a priori that reflects potential relationships 
between the variables. 

3.1 Sparsity-inducing Norms with Disjoint Groups of Variables 

The most natural form of structured sparsity is arguably group sparsity, match- 
ing the a priori knowledge that pre-specified disjoint blocks of variables should be 
selected or ignored simultaneously. In that case, if ^ is a collection of groups of 
variables, forming a partition of [I;]?]], and dg is a positive scalar weight indexed 
by group g, we define as 

(3.1) r2(w) = fig||wg||g for any (/ G (1, 00]. 

This norm is usually referred to as a mixed ii/iq-noim, and in practice, popular 
choices for q are {2, 00}. As desired, regularizing with leads variables in the 
same group to be selected or set to zero simultaneously (see Figure 2 for a geomet- 
ric interpretation). In the context of least-squares regression, this regularization 

imsart-sts ver. 2011/05/20 file: stat_science_structured_sparsity_final_version.tex date: April 23, 2012 




6 



BACH ET AL. 





(a) €i/^2-norm ball without over- 
laps: il(-w) = ||W{i,2}||2 + |W3 



(b) ^i/^2-norm ball with overlaps: 

f2(w) = ||W{i^2,3} II2 + |wi| + |W2| 



Fig 2: Comparison between two mixed ii/i2-norm balls in three dimensions (the 
first two directions are horizontal, the third one is vertical), without and with 
overlapping groups of variables, respectively on the left and right figures. The 
singular points appearing on these balls describe the sparsity-inducing behavior 
of the underlying norms Q. 

is known as the group Lasso (Turlach, Venables and Wright, 2005; Yuan and Lin, 
2006). It has been shown to improve the prediction performance and/or inter- 
pretability of the learned models when the block structure is relevant (Roth and 
Fischer, 2008; Stojnic, Parvaresh and Hassibi, 2009; Lounici et al., 2009; Huang 
and Zhang, 2010). Moreover, applications of this regularization scheme arise also 
in the context of multi-task learning (Obozinski, Taskar and Jordan, 2010; Quat- 
toni et al., 2009; Liu, Palatucci and Zhang, 2009) to account for features shared 
across tasks, and multiple kernel learning (Bach, 2008) for the selection of differ- 
ent kernels (see also Section 5). 

Choice of the weights. When the groups vary significantly in size, results can be 
improved, in particular under high-dimensional scaling, by an appropriate choice 
of the weights dg which compensate for the discrepancies of sizes between groups. 
It is difficult to provide a unique choice for the weights. In general, they depend 
on q and on the type of consistency desired. We refer the reader to Yuan and Lin 
(2006); Bach (2008); Obozinski, Jacob and Vert (2011); Lounici et al. (2011) for 
general discussions. 

It might seem that the case of groups that overlap would be unnecessarily 
complex. It turns out, in reality, that appropriate collections of overlapping groups 
allow to encode quite interesting forms of structured sparsity. In fact, the idea 
of constructing sparsity-inducing norms from overlapping groups will be key. We 
present two different constructions based on overlapping groups of variables that 
are essentially complementary of each other in Sections 3.2 and 3.3. 

3.2 Sparsity-inducing Norms with Overlapping Groups of Variables 

In this section, we consider a direct extension of the norm introduced in the 
previous section to the case of overlapping groups; we give an informal overview 
of the structures that it can encode and examples of relevant applied settings. 
For more details see Jenatton, Audibert and Bach (2011). 
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Starting from the definition of in Eq. (3.1), it is natural to study what 
happens when the set of groups Q is ahowed to contain elements that overlap. In 
fact, and as shown by Jenatton, Audibert and Bach (2011), the sparsity-inducing 
behavior of remains the same: when regularizing by J7, some entire groups of 
variables g in Q are set to zero. This is reflected in the set of non-smooth extreme 
points of the unit ball of the norm represented on Figure 2-(b). While the resulting 
patterns of nonzero variables — also referred to as supports, or nonzero patterns — 
were obvious in the non-overlapping case, it is interesting to understand here 
the relationship that ties together the set of groups G and its associated set of 
possible nonzero patterns. Let us denote by V the latter set. For any norm of 
the form (3.1), it is still the case that variables belonging to a given group are 
encouraged to be set simultaneously to zero; as a result, the possible zero patterns 
for solutions of (2.1) are obtained by forming unions of the basic groups, which 
means that the possible supports are obtained by taking the intersection of a 
certain number of complements of the basic groups. 

Moreover, under mild conditions (Jenatton, Audibert and Bach, 2011), given 
any intersection-closed'^ family of patterns V of variables (see examples below), 
it is possible to build an ad-hoc set of groups Q — and hence, a regularization 
norm ft — that enforces the support of the solutions of (2.1) to belong to V. 

These properties make it possible to design norms that are adapted to the 
structure of the problem at hand, which we now illustrate with a few examples. 

One-dimensional interval pattern. Given p variables organized in a sequence, 
using the set of groups G of Figure 3, it is only possible to select contiguous 
nonzero patterns. In this case, we have \G\ = 0{p). Imposing the contiguity of the 
nonzero patterns can be relevant in the context of variable forming time series, 
or for the diagnosis of tumors, based on the profiles of CGH arrays (Rapaport, 
Barillot and Vert, 2008), since a bacterial artificial chromosome will be inserted 
as a single continuous block into the genome. 




Fig 3: (Left) The set of blue groups to penalize in order to select contiguous 
patterns in a sequence. (Right) In red, an example of such a nonzero pattern 
with its corresponding zero pattern (hatched area). 



Two-dimensional convex support. Similarly, assume now that the p variables 
are organized on a two-dimensional grid. To constrain the allowed supports V to 
be the set of all rectangles on this grid, a possible set of groups G to consider is 
represented in the top of Figure 4. This set is relatively small since \G\ = 0{y/p). 
Groups corresponding to half-planes with additional orientations (see Figure 4 
bottom) may be added to "carve out" more general convex patterns. See an 
illustration in Section 4.4. 

set A is said to be intersection-closed, if for any fc £ N, and for any (ai, . . . , Uk) € A'' , we 
have HiLiii G -4. 
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H : 







Fig 4: Top: Vertical and horizontal groups: (Left) the set of blue and green groups 
to penalize in order to select rectangles. (Right) In red, an example of nonzero 
pattern recovered in this setting, with its corresponding zero pattern (hatched 
area). Bottom: Groups with ±-k/4 orientations: (Left) the set of blue and green 
groups with their (not displayed) complements to penalize in order to select 
diamond-shaped patterns. 



Two-dimensional block structures on a grid. Using sparsity-inducing regular- 
izations built upon groups which are composed of variables together with their 
spatial neighbors leads to good performances for background subtraction (Cevher 
et al., 2008; Baraniuk et al., 2010; Huang, Zhang and Metaxas, 2011; Mairal et al., 
2011), topographic dictionary learning (Kavukcuoglu et al., 2009; Mairal et al., 
2011), and wavelet-based denoising (Rao et al., 2011). 

Hierarchical structure. A fourth interesting example assumes that the variables 
are organized in a hierarchy. Precisely, we assume that the p variables can be 
assigned to the nodes of a tree T (or a forest of trees) , and that a given variable 
may be selected only if all its ancestors in T have already been selected. This 
hierarchical rule is exactly respected when using the family of groups displayed 
on Figure 5. The corresponding penalty was first used by Zhao, Rocha and Yu 
(2009); one of it simplest instance in the context of regression is the sparse group 
Lasso (Sprechmann et al., 2010; Friedman, Hastie and Tibshirani, 2010); it has 
found numerous applications, for instance, wavelet-based denoising (Zhao, Rocha 
and Yu, 2009; Baraniuk et al., 2010; Huang, Zhang and Metaxas, 2011; Jenatton 
et al., 2011b), hierarchical dictionary learning for both topic modelling and image 
restoration (Jenatton et al., 2011b), log-linear models for the selection of potential 
orders (Schmidt and Murphy, 2010), bioinformatics, to exploit the tree structure 
of gene networks for multi-task regression (Kim and Xing, 2010), and multi-scale 
mining of fMRI data for the prediction of simple cognitive tasks (Jenatton et al., 
2011a). 

Extensions. Possible choices for the sets of groups G are not limited to the 
aforementioned examples: more complicated topologies can be considered, for 
example three-dimensional spaces discretized in cubes or spherical volumes dis- 
cretized in slices (see an application to neuroimaging by Varoquaux et al. (2010)), 
and more complicated hierarchical structures based on directed acyclic graphs can 
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Fig 5: Left: set of groups Q (dashed contours in red) corresponding to the tree T 
with p = 6 nodes represented by black circles. Right: example of a sparsity pattern 
induced by the tree-structured norm corresponding to Q: the groups {2, 4}, {4} 
and {6} are set to zero, so that the corresponding nodes (in gray) that form 
subtrees of T are removed. The remaining nonzero variables {1,3,5} form a 
rooted and connected subtree of T- This sparsity pattern obeys the two following 
equivalent rules: (i) if a node is selected, the same goes for all its ancestors, (ii) 
if a node is not selected, then its descendant are not selected. 



be encoded as further developed in Section 5. 

Choice of the weights. The choice of the weights dg is significantly more im- 
portant in the overlapping case both theoretically and in practice. In addition to 
compensating for the discrepancy in group sizes, the weights additionally have to 
make up for the potential over-penalization of parameters contained in a larger 
number of groups. For the case of one-dimensional interval patterns, Jenatton, 
Audibert and Bach (2011) showed that it was more efficient in practice to actu- 
ally weight each individual coefficient inside of a group as opposed to weighting 
the group globally. 

3.3 Norms for Overlapping Groups: a Latent Variable Formulation 

The family of norms defined in Eq. (3.1) is adapted to intersection-closed sets 
of nonzero patterns. However, some applications exhibit structures that can be 
more naturally modelled by union-closed families of supports. This idea was 
introduced by Jacob, Obozinski and Vert (2009) and Obozinski, Jacob and Vert 
(2011) who, given a set of groups Q, proposed the following norm 

(3.2) ^^union(w) ^ min y^dg\\^r'}\\g such that J J^^C"' 1, "^'^ . ^ 

where again dg is a positive scalar weight associated with group g. 

The norm we just defined provides a different generalization of the ii/iq-norm 
to the case of overlapping groups than the norm presented in Section 3.2. In fact, 
it is easy to see that solving Eq. (2.1) with the norm reunion is equivalent to solving 



^91 



and setting w = X^^gg . This last equation shows that using the norm Ounion 
can be interpreted as implicitly duplicating the variables belonging to several 
groups and regularizing with a weighted ^i/£g-norm for disjoint groups in the 
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expanded space. Again in this case a careful choice of the weights is important 
(Obozinski, Jacob and Vert, 2011). 

This latent variable formulation pushes some of the vectors to zero while 
keeping others with no zero components, hence leading to a vector w with a 
support which is in general the union of the selected groups. Interestingly, it 
can be seen as a convex relaxation of a non-convex penalty encouraging similar 
sparsity patterns which was introduced by Huang, Zhang and Metaxas (2011) 
and which we present in Section 3.4. 

Graph Lasso. One type of a priori knowledge commonly encountered takes the 
form of a graph defined on the set of input variables, which is such that connected 
variables are more likely to be simultaneously relevant or irrelevant; this type 
of prior is common in genomics where regulation, co-expression or interaction 
networks between genes (or their expression level) used as predictors are often 
available. To favor the selection of neighbors of a selected variable, it is possible 
to consider the edges of the graph as groups in the previous formulation (see 
Jacob, Obozinski and Vert, 2009; Rao et al., 2011). 

Patterns consisting of a small number of intervals. A quite similar situation 
occurs, when one knows a priori — typically for variables forming sequences (times 
series, strings, polymers) — that the support should consist of a small number 
of connected subsequences. In that case, one can consider the sets of variables 
forming connected subsequences (or connected subsequences of length at most k) 
as the overlapping groups (Obozinski, Jacob and Vert, 2011). 




(a) Unit ball for G = {{1, 3}, {2, 3}} (b) Q = {{1, 3}, {2, 3}, {1, 2}} 

Fig 6: Two instances of unit balls of the latent group Lasso regularization rJunion 
for two or three groups of two variables. Their singular points lie on axis aligned 
circles, corresponding to each group, and whose convex hull is exactly the unit 
ball. It should be noted that the ball on the left is quite similar to the one of 
Fig. 2b except that its "poles" are flatter, hence discouraging the selection of X3 
without either xi or X2. 

3.4 Related Approaches to Structured Sparsity 

Norm design through suhmodular functions. Another approach to structured 
sparsity relies on submodular analysis (Bach, 2010). Starting from a non-de- 



imsart-sts ver. 2011/05/20 file: stat_science_structured_sparsity_final_versioii.tex date: April 23, 2012 



STRUCTURED SPARSITY THROUGH CONVEX OPTIMIZATION 



11 



creasing, submodular"^ set-function F of the supports of the parameter vector 
w — i.e., w I— 7- F{{i G Wj ^ 0}) — a structured sparsity-inducing norm 

can be built by considering its convex envelope (tightest convex lower bound) 
on the unit £oo-norm ball. By selecting the appropriate set-function F, similar 
structures to those described above can be obtained. This idea can be further 
extended to symmetric, submodular set-functions of the level sets of w, that 
is, I— maX;^g]KF({j G Wj > i/}), thus leading to different types of 

structures (Bach, 2011), allowing to shape the level sets of w rather than its 
support. This approach can also be generalized to any set-function and other 
priors on the the non-zero variables than the £oo-norm (Obozinski and Bach, 



where ^ is a given set of groups, and {wgjggg is a set of positive weights which de- 
fines a coding length. In other words, the penalty ip measures from an information- 
theoretic viewpoint, "how much it costs" to represent w. Finally, in the context 
of compressed sensing, the work of Baraniuk et al. (2010) also focuses on union- 
closed families of supports, although without information-theoretic considera- 
tions. All of these non-convex approaches can in fact also be relaxed to convex 
optimization problems (Obozinski and Bach, 2012). 

Other forms of sparsity. We end this review by discussing sparse regulariza- 
tion functions encoding other types of structures than the structured sparsity 
penalties we have presented. We start with the total-variation penalty originally 
introduced in the image processing community (Rudin, Osher and Fatemi, 1992), 
which encourages piecewise constant signals. It can be found in the statistics 
literature under the name of "fused lasso" (Tibshirani et al., 2005). For one- 
dimensional signals, it can be seen as the £i-norm of finite differences for a vec- 
tor w in W: Otv-id(w) = "^2^=1 l^j+i — Wj|. Extensions have been proposed 
for multi-dimensional signals and for recovering piecewise constant functions on 
graphs (Kim, Sohn and Xing, 2009). 

We remark that we have presented group-sparsity penalties in Section 3.1, 
where the goal was to select a few groups of variables. A different approach 
called "exclusive Lasso" consists instead of selecting a few variables inside each 
group, with some applications in multitask learning (Zhou, Jin and Hoi, 2010). 

Finally, we would like to mention a few works on automatic feature group- 
ing (Bondell and Reich, 2008; Shen and Huang, 2010; Zhong and Kwok, 2011), 
which could be used when no a-priori group structure G is available. These penal- 
ties are typically made of pairwise terms between all variables, and encourage 
some coefficients to be similar, thereby forming "groups". 

■^Let S be a finite set. A function f : 2^ — >■ R is said to be submodular if for any subset 
A,B C S, we have the inequality F{A n B) + F{A U B) < F{A) + F{B); see Bach (2011) and 
references therein. 
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Non-convex approaches. We mainly focus in this review on convex penalties 
but in fact many non-convex approaches have been proposed as well. In the same 
spirit as the norm (3.2), Huang, Zhang and Metaxas (2011) considered the penalty 
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3.5 Convex Optimization with Proximal Methods 

In this section, we briefly review proximal methods which are convex optimiza- 
tion methods particularly suited to the norms we have defined. They essentially 
allow to solve the problem regularized with a new norm at low implementation 
and computational costs. For a more complete presentation of optimization tech- 
niques adapted to sparsity-inducing norms, see Bach et al. (2012). 

Proximal methods constitute a class of first-order techniques typically designed 
to solve problem (2.1) (Nesterov, 2007; Beck and Teboulle, 2009; Combettes and 
Pesquet, 2010). They take advantage of the structure of (2.1) as the sum of two 
convex terms. For simplicity, we will present here the proximal method known as 
forward-backward splitting which assumes that at least one of these two terms, 
is smooth. Thus, we will typically assume that the loss function £ is convex 
differentiable, with Lipschitz-continuous gradients (such as the logistic or square 
loss), while will only be assumed convex. 

Proximal methods have become increasingly popular over the past few years, 
both in the signal processing (e.g., Becker, Bobin and Candes, 2009; Wright, 
Nowak and Figueiredo, 2009; Combettes and Pesquet, 2010, and numerous ref- 
erences therein) and in the machine learning communities (e.g., Jenatton et al., 
2011b; Chen et al., 2011; Bach et al., 2012, and references therein). In a broad 
sense, these methods can be described as providing a natural extension of gradient- 
based techniques when the objective function to minimize has a non-smooth part. 
Proximal methods are iterative procedures. Their basic principle is to linearize, at 
each iteration, the function / around the current estimate w, and to update this 
estimate as the (unique, by strong convexity) solution of the so-called proximal 
problem. Under the assumption that / is a smooth function, it takes the form: 

/(w) + (w - w)^V/(w) + Xil.{w) + ^\\w - w\\l . 

The role of the added quadratic term is to keep the update in a neighborhood 
of w where / stays close to its current linear approximation; L > is a parameter 
which is an upper bound on the Lipschitz constant of V/. 

Provided that we can solve efficiently the proximal problem (3.4), this first iter- 
ative scheme constitutes a simple way of solving problem (2.1). It appears under 
various names in the literature: proximal-gradient techniques (Nesterov, 2007), 
forward-backward splitting methods (Combettes and Pesquet, 2010), and itera- 
tive shrinkage-thresholding algorithm (Beck and Teboulle, 2009). Furthermore, 
it is possible to guarantee convergence rates for the function values (Nesterov, 
2007; Beck and Teboulle, 2009), and after k iterations, the precision be shown to 
be of order 0(l/fc), which should contrasted with rates for the subgradient case, 
that are rather 0{1/Vk). 

This first iterative scheme can actually be extended to "accelerated" ver- 
sions (Nesterov, 2007; Beck and Teboulle, 2009). In that case, the update is not 
taken to be exactly the result from (3.4); instead, it is obtained as the solution of 
the proximal problem applied to a well-chosen linear combination of the previous 
estimates. In that case, the function values converge to the optimum with a rate 
of 0(1/A;^), where k is the iteration number. From Nesterov (2004), we know 
that this rate is optimal within the class of first-order techniques; in other words, 
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accelerated proximal-gradient methods can be as fast as without a non-smooth 
component. 

We have so far given an overview of proximal methods, without specifying 
how we precisely handle its core part, namely the computation of the proximal 
problem, as defined in (3.4). 

Proximal Problem. We first rewrite problem (3.4) as 



Under this form, we can readily observe that when A = 0, the solution of the prox- 
imal problem is identical to the standard gradient update rule. The problem above 
can be more generally viewed as an instance of the proximal operator (Moreau, 
1962) associated with Xil.: 



For many choices of regularizers Q, the proximal problem has a closed- form 
solution, which makes proximal methods particularly efficient. It turns out that 
for the norms defined in this paper, we can compute in a large number of cases the 
proximal operator exactly and efficiently (see Bach et al., 2012). If 0, is chosen to 
be the £i-norm, the proximal operator is simply the soft-thresholding operator ap- 
plied elementwise (Donoho and Johnstone, 1995). More formally, we have for all j 
in [[l;p]], Proxjj^ii II Ju]j = sign(uj) max(|uj| — A,0). For the group Lasso penalty of 
Eq. (3.1) with q = 2, the proximal operator is a group-thresholding operator and 
can be also computed in closed form: ProxAn[u]g = (ug/||ug||2) max(||ug||2 — A, 0) 
for all g in Q. For norms with hierarchical groups of variables (in the sense de- 
fined in Section 3.2), the computation of the proximal operator can be obtained 
by a composition of group-thresholding operators in a time linear in the num- 
ber p of variables (Jenatton et al., 2011b). In other settings, e.g., general over- 
lapping groups of £oo-norms, the exact proximal operator implies a more expen- 
sive polynomial dependency on p using network- fiow techniques (Mairal et al., 
2011), but approximate computation is possible without harming the convergence 
speed (Schmidt, Le Roux and Bach, 2011). Most of these norms and the associ- 
ated proximal problems are implemented in the open-source software SPAMS^. 

In summary, with proximal methods, generalizing algorithms from the £i-norm 
to a structured norm requires only to be able to compute the corresponding 
proximal operator, which can be done efficiently in many cases. 

3.6 Theoretical Analysis 

Sparse methods are traditionally analyzed according to three different criteria; 
it is often assumed that the data were generated by a sparse loading vector w*. 
Denoting w a solution of the M-estimation problem in Eq. (2.1), traditional 
statistical consistency results aim at showing that ||w* — w|| is small for a certain 
norm || • ||; model consistency considers the estimation of the support of w* as 
a criterion, while, prediction efficiency only cares about the prediction of the 
model, i.e., with the square loss, the quantity ||Xw* — Xw||2 has to be as small 
as possible. 

''http : //www . di . ens . f r/willow/SPAMS/ 



1 
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Prox^n : u G I— 7- argmin — ||u — v||2 + Ar2(v). 
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A striking consequence of assuming that w* has many zero components is 
that for the three criteria, consistency is achievable even when p is much larger 
than n (Zhao and Yu, 2006; Wainwright, 2009; Bickel, Ritov and Tsybakov, 2009; 
Zhang, 2009). 

However, to relax the often unrealistic assumption that the data are gener- 
ated by a sparse loading vector, and also because a good predictor, especially in 
the high-dimensional setting, can possibly be much sparser than any potential 
true model generating the data, prediction efficiency is often formulated under 
the form of oracle inequalities, where the performance of the estimator is upper 
bounded by the performance of any function in a fixed complexity class, reflecting 
approximation error, plus a complexity term characterizing the class and reflect- 
ing the hardness of estimation in that class. We refer the reader to van de Geer 
(2010) for a review and references on oracle results for the Lasso and the group 
Lasso. 

It should be noted that model selection consistency and prediction efficiency are 
obtained in quite different regimes of regularization, so that it is not possible to 
obtain both types of consistency with the same Lasso estimator (Shalev-Shwartz, 
Srebro and Zhang, 2010). For prediction consistency, the regularization parame- 
ter is easily chosen by cross-validation on the prediction error. For model selection 
consistency, the regularization coefficient should typically be much larger than for 
prediction consistency; but rather than trying to select an optimal regularization 
parameter in that case, it is more natural to consider the collection of models ob- 
tained along the regularization path and to apply usual model selection methods 
to choose the best model in the collection. One method that works reasonably 
well in practice, sometimes called "OLS hybrid" for the least squares loss (Efron 
et al., 2004), consists in refitting the different models without regularization and 
to choose the model with the best fit by cross-validation. 

In structured sparse situations, such high-dimensional phenomena can also be 
characterized. Essentially, if one can make the assumption that w* is compati- 
ble with the additional prior knowledge on the sparsity pattern encoded in the 
norm 0, then, some of the assumptions required for consistency can sometimes 
be relaxed (see Huang and Zhang, 2010; Jenatton, Audibert and Bach, 2011; 
Huang, Zhang and Metaxas, 2011; Bach, 2010), and faster rates can sometimes 
be obtained (Huang and Zhang, 2010; Huang, Zhang and Metaxas, 2011; Obozin- 
ski, Wainwright and Jordan, 2011; Negahban and Wainwright, 2011; Bach, 2009; 
Percival, 2012). However, one major difficulty that arises is that some of the 
conditions for recovery or to obtain fast rates of convergence depend on an in- 
tricate interaction between the sparsity pattern, the design matrix and the noise 
covariance, which leads in each case to sufficient conditions that are typically 
not directly comparable between different structured or unstructured cases (Je- 
natton, Audibert and Bach, 2011). Moreover, even if the sufficient conditions 
are satisfied simultaneously for the norms to be compared, sharper bounds on 
rates and sample complexities would still often be needed to characterize more 
accurately the improvement resulting from having a stronger structural a priori. 
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4. SPARSE PRINCIPAL COMPONENT ANALYSIS AND DICTIONARY 

LEARNING 

Unsupervised learning aims at extracting latent representations of the data 
that are useful for analysis, visualization, denoising or to extract relevant infor- 
mation to solve subsequently a supervised learning problem. Sparsity or struc- 
tured sparsity are essential to specify, on the representations, constraints that 
improve their identifiability and interpretability. 

4.1 Analysis and Synthesis Views of PC A 

Depending on how the latent representation is extracted or constructed from 
the data, it is useful to distinguish two points of view. This is illustrated well in 
the case of PCA. 

In the analysis view, PCA aims at finding sequentially a set of directions in 
space that explain the largest fraction of the variance of the data. This can be 
formulated as an iterative procedure in which a one-dimensional projection of 
the data with maximal variance is found first, then the data are projected on 
the orthogonal subspace (corresponding to a deflation of the covariance matrix), 
and the process is iterated. In the synthesis view, PCA aims at finding a set 
of vectors, or dictionary elements (in a terminology closer to signal processing) 
such that all observed signals admit a linear decomposition on that set with low 
reconstruction error. In the case of PCA, these two formulations lead to the same 
solution (an eigenvalue problem). However, in extensions of PCA, in which either 
the dictionary elements or the decompositions of signals are constrained to be 
sparse or structured, they lead to different algorithms with different solutions. 

The analysis interpretation leads to sequential formulations (d'Aspremont, 
Bach and El Ghaoui, 2008; Moghaddam, Weiss and Avidan, 2006; Jolliffe, Trendafilov 
and Uddin, 2003) that consider components one at a time and perform a de- 
flation of the covariance matrix at each step (see Mackey, 2009). The synthe- 
sis interpretation leads to non-convex global formulations (see, e.g., Zou, Hastie 
and Tibshirani, 2006; Moghaddam, Weiss and Avidan, 2006; Aharon, Elad and 
Bruckstein, 2006; Mairal et al., 2010) which estimate simultaneously all principal 
components, typically do not require the orthogonality of the components, and 
are referred to as matrix factorization problems (Singh and Gordon, 2008; Bach, 
Mairal and Ponce, 2008) in machine learning, and dictionary learning in signal 
processing (Olshausen and Field, 1996). 

While we could also impose structured sparse priors in the analysis view, we 
will consider from now on the synthesis view, that we will introduce with the 
terminology of dictionary learning. 

4.2 Dictionary Learning 

Given a matrix X G ig»"X" of n columns corresponding to n observations in W^, 
the dictionary learning problem is to find a matrix D G R™^*', called dictionary, 
such that each observation can be well approximated by a linear combination of 
the p columns (d'^)fcgjx.p| of D called the dictionary elements. If A G MP^"^ is 
the matrix of the linear combination coefficients or decomposition coefficients (or 
codes), with the k-ih column of A being the coefficients for the A;-th signal 
x'^, the matrix product DA is called a decomposition of X. 

Learning simultaneously the dictionary D and the coefficients A corresponds 
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to a matrix factorization problem (see Witten, Tibshirani and Hastie, 2009, and 
reference therein). 

As formulated by Bach, Mairal and Ponce (2008) or Witten, Tibshirani and 
Hastie (2009), it is natural, when learning a decomposition, to penalize or con- 
strain some norms or pseudo- norms of A and D, say and J^d respectively, to 
encode prior information — typically sparsity — about the decomposition of X. 
While in general the penalties could be defined globally on the matrices A and 
D, we assume that each column of D and A is penalized separately. This can be 
written as 



DeR^xp 

where the regularization parameter A > controls to which extent the dictionary 
is regularized. If we assume that both regularizations f^A and f^D are convex, 
problem (4.1) is convex with respect to A for fixed D and vice versa. It is how- 
ever not jointly convex in the pair (A,D), but alternating optimization schemes 
generally lead to good performance in practice. 

4.3 Imposing Sparsity 

The choice of the two norms and JId is crucial and heavily influences the 
behavior of dictionary learning. Without regularization, any solution (D, A) is 
such that DA is the best fixed-rank approximation of X, and the problem can 
be solved exactly with a classical PCA. When is the £i-norm and Od the 
^2-norm, we aim at finding a dictionary such that each signal x* admits a sparse 
decomposition on the dictionary. In this context, we are essentially looking for 
a basis where the data have sparse decompositions, a framework we refer to as 
sparse dictionary learning. On the contrary, when Qa is the £2-norm and 
the ^i-norm, the formulation induces sparse principal components, i.e., atoms 
with many zeros, a framework we refer to as sparse PCA. In Section 4.4 and 
Section 4.5, we replace the ^i-norm by structured norms introduced in Section 3, 
leading to structured versions of the above estimation problems. 

4.4 Adding Structures to Principal Components 

One of PCA's main shortcomings is that, even if it finds a small number of 
important factors, the factor themselves typically involve all original variables. 
In the last decade, several alternatives to PCA which find sparse and potentially 
interpretable factors have been proposed, notably non-negative matrix factoriza- 
tion (NMF) (Lee and Seung, 1999) and sparse PCA (SPCA) (Jolliffe, Trendafilov 
and Uddin, 2003; Zou, Hastie and Tibshirani, 2006; Zass and Shashua, 2007; 
Witten, Tibshirani and Hastie, 2009). 

However, in many applications, only constraining the size of the supports of the 
factors does not seem appropriate because the considered factors are not only ex- 
pected to be sparse but also to have a certain structure. In fact, the popularity of 
NMF for face image analysis owes essentially to the fact that the method happens 
to retrieve sets of variables that are partly localized on the face and capture some 
features or parts of the face which seem intuitively meaningful given our a priori. 
We might therefore gain in the quality of the factors induced by enforcing directly 
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this a priori in the matrix factorization constraints. More generally, it would be 
desirable to encode higher-order information about the supports that reflects the 
structure of the data. For example, in computer vision, features associated to the 
pixels of an image are naturally organized on a grid and the supports of factors 
explaining the variability of images could be expected to be localized, connected 
or have some other regularity with respect to that grid. Similarly, in genomics, 
factors explaining the gene expression patterns observed on a microarray could 
be expected to involve groups of genes corresponding to biological pathways or 
set of genes that are neighbors in a protein-protein interaction network. 

Based on these remarks and with the norms presented earlier, sparse PCA is 
readily extended to structured sparse PCA (SSPCA), which explains the variance 
of the data by factors that are not only sparse but also respect some a priori struc- 
tural constraints deemed relevant to model the data at hand: slight variants of 
the regularization term defined in Section 3 (with the groups defined in Figure 4) 
can be used successfully for J^d- 

Experiments on face recognition. By definition, dictionary learning belongs to 
unsupervised learning; in that sense, our method may appear first as a tool for 
exploratory data analysis, which leads us naturally to qualitatively analyze the 
results of our decompositions (e.g., by visualizing the learned dictionaries). This 
is obviously a difficult and subjective exercise, beyond the assessment of the 
consistency of the method in artificial examples where the "true" dictionary is 
known. For quantitative results, see Jenatton, Obozinski and Bach (2010).^ 

We apply SSPCA on the cropped AR Face Database (Martinez and Kak, 2001) 
that consists of 2600 face images, corresponding to 100 individuals (50 women 
and 50 men). For each subject, there are 14 non-occluded poses and 12 occluded 
ones (the occlusions are due to sunglasses and scarfs). We reduce the resolution 
of the images from 165 x 120 pixels to 38 x 27 pixels for computational reasons. 

Figure 7 shows examples of learned dictionaries (for p = 36 elements), for 
NMF, unstructured sparse PCA (SPCA), and SSPCA. While NMF finds sparse 
but spatially unconstrained patterns, SSPCA selects sparse convex areas that 
correspond to a more natural segment of faces. For instance, meaningful parts 
such as the mouth and the eyes are recovered by the dictionary. 

4.5 Hierarchical Dictionary Learning 

In this section, we consider sparse dictionary learning, where the structured 
sparse prior knowledge is put on the decomposition coefficients, i.e., the matrix A 
in Eq. (4.1), and present an application to text documents. 

Text documents. The goal of probabilistic topic models is to find a low-dimen- 
sional representation of a collection of documents, where the representation should 
provide a semantic description of the collection. Approaching the problem in a 
parametric Bayesian framework, latent Dirichlet allocation (LDA), Blei, Ng and 
Jordan (2003) models documents, represented as vectors of word counts, as a 
mixture of a predefined number of latent topics, defined as multinomial distribu- 
tions over a fixed vocabulary. The number of topics is usually small compared to 
the size of the vocabulary (e.g., 100 against 10000), so that the topic proportions 
of each document provide a compact representation of the corpus. 

'''A Matlab toolbox implementing our method can be downloaded from 
http : //www. di . ens . fr/~ jenatton/. 



imsart-sts ver. 2011/05/20 file: stat_science_structured_sparsity_final_versioii.tex date: April 23, 2012 



18 




BACH ET AL 





■ 






i 


HI 






■ 


B 


■ 


■■ 


BH 


■ 


I 


■ 


■■■ 


■ 












■ 


■ 




■I 


I 


■ 




■I 


1 


■■■■■ 






■HI 


i 


III 


22 



Fig 7: Top left, examples of faces in the datasets. The three remaining images rep- 
resent learned dictionaries of faces with p = 36: NMF (top right), SPCA (bottom 
left) and SSPCA (bottom right). The dictionary elements are sorted in decreasing 
order of explained variance. While NMF gives sparse spatially unconstrained pat- 
terns, SSPCA finds convex areas that correspond to more natural face segments. 
SSPCA captures the left /right illuminations and retrieves pairs of symmetric pat- 
terns. Some displayed patterns do not seem to be convex, e.g., nonzero patterns 
located at two opposite corners of the grid. However, a closer look at these dic- 
tionary elements shows that convex shapes are indeed selected, and that small 
numerical values (just as regularizing by ^2-iiorm may lead to) give the visual 
impression of having zeroes in convex nonzero patterns. This also shows that if 
a nonconvex pattern has to be selected, it will be, by considering its convex hull. 



In fact the problem addressed by LDA is fundamentally a matrix factorization 
problem. For instance, Buntine (2002) argued that LDA can be interpreted as 
a Dirichlet-multinomial counterpart of factor analysis. We can actually cast the 
problem in the dictionary learning formulation that we presented before^. Indeed, 



^Doing so we simply trade the multinomial likelihood with a least-square formulation. 
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suppose that the signals X = [x^, . . . ,x"] in M"*^" are each the so-called bag-of- 
word representation of each of n documents over a vocabulary of m words, i.e., 
X* is a vector whose fc-th component is the empirical frequency in document i 
of the fc-th word of a fixed lexicon. If we constrain the entries of D and A 
to be nonnegative, and the dictionary elements d-' to have unit ^i-norm, the 
decomposition (D, A) can be interpreted as the parameters of a topic-mixture 
model. Sparsity here ensures that a document is described by a small number of 
topics. 

Switching to structured sparsity allows in this case to organize automatically 
the dictionary of topics in the process of learning it. Assume that J^a in Eq. (4.1) 
is a tree-structured regularization, such as illustrated on Figure 5; in this case, 
in the light of Section 3.2, if the decomposition of a document involves a certain 
topic, then all ancestral topics in the tree are also present in the topic decompo- 
sition. Since the hierarchy is shared by all documents, the topics close to the root 
participate in every decomposition, and given that the dictionary is learned, this 
mechanism forces those topics to be quite generic — essentially gathering the lex- 
icon which is common to all documents. Conversely, the deeper the topics in the 
tree, the more specific they should be. It should be noted that such hierarchical 
dictionaries can also be obtained with generative probabilistic models, typically 
based on non-parametric Bayesian prior over trees or paths in trees, and which 
extend the LDA model to topic hierarchies (Blei, Griffiths and Jordan, 2010; 
Adams, Ghahramani and Jordan, 2010). 
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Fig 8: Example of a topic hierarchy estimated from 1714 NIPS proceedings pa- 
pers (from 1988 through 1999). Each node corresponds to a topic whose 5 most 
important words are displayed. Single characters such as n, t, r are part of the 
vocabulary and often appear in NIPS papers, and their place in the hierarchy is 
semantically relevant to children topics. 
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Visualization of NIPS proceedings. We qualitatively illustrate our approach on 
the NIPS proceedings from 1988 through 1999 (Griffiths and Steyvers, 2004). 
After removing words appearing fewer than 10 times, the dataset is composed of 
1714 articles, with a vocabulary of 8274 words. As explained above, we enforce 
both the dictionary and the sparse coefficients to be non-negative, and constrain 
the dictionary elements to have a unit £i-norm. Figure 8 displays an example of 
a learned dictionary with 13 topics, obtained by using a tree-structured penalty 
(see Section 3.2) on the coefficients A and by selecting manually'' A = 2~^^. 
As expected and similarly to Blei, Griffiths and Jordan (2010), we capture the 
stopwords at the root of the tree, and topics reflecting the different subdomains 
of the conference such as neurosciences, optimization or learning theory. 

5. HIGH-DIMENSIONAL NON-LINEAR VARIABLE SELECTION 

In this section, we show how structured sparsity-inducing norms may be used 
to provide an efficient solution to the problem of high-dimensional non-linear 
variable selection. Namely, given p variables x = (xi, . . . ,Xp), our aim is to find 
a non-linear function f(xi, . . . ,Xp) which depends only on a few variables. First 
approaches to the problem have considered restricted functional forms such as 
f (xi, . . . , Xp) = fi(xi) + • • • + fp(xp), where each fi, . . . , fp are univariate non- 
linear functions (Ravikumar et al., 2009; Bach, 2008). However, many non-linear 
functions cannot be expressed as sums of functions of these forms. Additional 
interactions have been added leading to functions of the form f(xi,...,Xp) = 
Sjc{i p} \J\ii2^ji^j) (.^^^ Zhang, 2006). While second-order interactions 
make the class of functions larger, our aim in this section is to consider functions 
which can be expressed as a sparse linear combination of the form f (xi, . . . , Xp) = 
J2jc{i p}^j(^j)^ i-^-' * combination of functions defined on potentially larger 
subsets of variables. 

The main difficulties associated with this problem are that (1) each function fj 
has to be estimated, leading to a non-parametric problem, and (2) there are expo- 
nentially many such functions. We propose however an approach that overcomes 
both difficulties in the next section, based on the ideas that estimating functions 
rather than vectors can be tackled with estimators in reproducing kernel Hilbert 
spaces (see Section 5.1), and that the complexity issues can be addressed by 
imposing some structure among all the subsets J C {1, . . . ,p} (see Section 5). 

5.1 Multiple Kernel Learning: From Linear to Non-Linear Predictions 

Reproducing kernel Hilbert spaces are arguably the simplest spaces for the 
non-parametric estimation of non-linear functions since most learning algorithms 
for linear models are directly ported to any RKHS via simple kernelization. We 
therefore start by reviewing learning from a single and later multiple reproducing 
kernels, since our approach will be based on combining functions from multiple 
(actually a hierarchy) of RKHSes. For more details, see Bach (2008). 

Single kernel learning. Let us assume that the n input data-points x^-*^) , • • • , x^") 
belong to a set X (not necessarily MP), and consider predictors of the form 

^The regularization parameter striking a good compromise between sparsity and reconstruc- 
tion of the data is chosen here by hand because (a) cross-vahdation would yield a significantly 
less sparse dictionary and (b) model selection criteria would not apply without serious caveats 
here since the dictionary is learned at the same time. 
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(/, ^*(x)) where ^ : X ^ is a map from the input space to a reproducing 
kernel Hilbert space J- (associated to the kernel function k), which we refer to as 
the feature space. These predictors are linearly parameterized, but may depend 
non-linearly on x. We consider the following estimation problem: 

1 " A 

i=l 

where || . || is the Hilbertian norm associated to J-". The representor theorem (Kimel- 
dorf and Wahba, 1971) states that, for all loss functions (potentially nonconvex), 
the solution f admits the expansion f = Y17=i o:i*^(^^*'*)) so that, replacing f by 
its new expression, we can now minimize 

1 " A 

i=l 

where K is the kernel matrix, an n x n matrix whose element is equal to 
($(x^*)), $(x(-''^)) = k{x^^\x.^^^). This optimization problem involves the observa- 
tions . . . , x^") only through the kernel matrix K, and can thus be solved, as 
long as K can be evaluated efficiently. See Shawe- Taylor and Cristianini (2004) 
for more details. 

Multiple kernel learning (MKL). We can now assume that we are given m 
Hilbert spaces J'j-, j = 1, . . . , m, and look for predictors of the form f(x) = gi(x) + 
■ • • + gm(x), where* each gj G Fj. In order to have many gj equal to zero, we can 
penalize f using a sum of norms similar to the group Lasso penalties introduced 
earlier, namely ||gj||j"j- This leads to selection of functions. Moreover, it 

turns out that the optimization problems may be expressed also in terms of 
the m kernel matrices, and it is equivalent to learn a sparse linear combination 
K = ^jLi ^jKj (with many r^j's equal to zero) of kernel matrices with then cx 

solution of the single kernel learning problem for K. For more details, see Bach 
(2008). 

From MKL to sparse generalized additive models. As shown above, the MKL 
framework is defined with any set of m RKHSes defined on the same base set X. 
When the base set is itself defined as a cartesian product of p base sets, i.e., X = 
A'l X • • • X Xp, then it is common to consider m = p RKHSes which are each of them 
defined on a single Xi, leading to the desired functional form fi(xi) + • • • + fp(xp). 
To overcome the limitation of this functional form we need to consider a more 
complex expansion. 

5.2 Hierarchical Kernel Learning 

In this section, we consider functional expansions with up to m = 2^ terms 
corresponding to different RKHSes, each defined on a cartesian product of a sub- 
set of the p separate input spaces. Specifically, we consider functions of the form 
f(xi, . . . ,Xp) = p} fj(xj) with fj chosen to live in a RKHS Tj defined 

on variables xj. Penalizing by the norm X^j^-ji p} II Oil J'j would in theory lead 
to an appropriate selection of functions from the various RKHSes (and to learn- 
ing a sparse linear combination of the corresponding kernel matrices). However, 
in practice, there are 2^ such predictors, which is not algorithmically feasible. 

^Notice that the function gj is not restricted to depend only on a subpart of x as before, 
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Fig 9: Power set of the set {1, . . . , 4}: in blue, an authorized set of selected subsets. 
In red, an example of a group used within the norm (a subset and all of its 
descendants in the DAG). 



This is where structured sparsity comes into play. In order to obtain polynomial- 
time algorithms and theoretically controlled predictive performance, we may add 
an extra constraint to the problem. Namely, we endow the power set of {1, ... ,p} 
with the partial order of the inclusion of sets, and in this directed acyclic graph 
(DAG), we require that predictors f select a subset only after all of its ances- 
tors have been selected. This can be achieved in a convex formulation using a 
structured-sparsity inducing norm of the type presented in Section 3.2, but de- 
fined by a hierarchy of groups as follows 

JC{l,...,p} ^HDJ 

As illustrated in Figure 9, this norm corresponds to overlapping groups of vari- 
ables defined on the directed acyclic graphs of all subsets of {1, . . . We will 
explain briefly how introducing this norm may lead to polynomial time algo- 
rithms and what theoretical guarantees are associated with it. Illustrations of 
the application of hierarchical kernel learning to real data can be found in Bach 
(2009). 

Polynomial-time estimation algorithm. While we are, a priori, still facing an 
estimation problem with 2^ functions, it can be solved using an active set method, 
which considers adding a component fj G J^j (resp. Kj) to the active set of 
predictors (resp. kernels). The two crucial aspects are (1) to add the right kernel, 
i.e., choose among the 2^ which one to add, and (2) when to stop. As shown in 
Bach (2009), these steps may be carried out efficiently for certain collections of 
RKHSes J-'j, in particular those for which we are able to compute efficiently (i.e., 
in polynomial time in p) the sum X]jc{i p} This is the case, for example, 
for Gaussian kernels kj{-x.j,-K'j) = exp(— 7||xj — XjUg'). 

Theoretical analysis. Bach (2009) showed that under appropriate assumptions, 
estimation under high-dimensional scaling, i.e., for p ^ n but logp = 0{n), is 
possible in this situation, in spite of the fact that the number of terms in the 
expansion is now potentially doubly exponential in n. 

6. CONCLUSION 

In this paper, we reviewed several approaches for structured sparsity, based 
on convex optimization and the design of appropriate sparsity- inducing norms. 
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Analyses and algorithms for the traditional £i-norm can readily be extended to 
these new norms, making them an efficient and flexible tools for introducing prior 
knowledge in high-dimensional statistical problems. We also presented several ap- 
plications to supervised and unsupervised learning problems, where the proper 
use of additional knowledge leads to improved interpretability of the sparse esti- 
mates and/or increased predictive performance. 
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